class: center, middle, inverse, title-slide .title[ # Class 2d: Review of concepts in Probability and Statistics ] .author[ ### Business Forecasting ] --- <style type="text/css"> .remark-slide-content { font-size: 20px; } </style> --- layout: false class: inverse, middle # Methods of Qualitative Forecasting --- ### Delphi Method - A structured communication process to reach a consensus for complex, uncertain and long terms forecasting tasks 1. Select a group of experts 2. Invite them to the study. They are anonymous and don't talk to each other! 3. Ask them to answer a questionnaire 4. Get initial responses 5. Compile them into summary 6. Send them summary and get their feedback with refined answers 7. Reiterate until consensus is reached or no further improvement -- #### Example: Determining AI threats - What are the risks of AI developments? - Panel of experts from academia and industry - Computer scientists, engineers, CEOs of AI companies, ethic experts - Send them questionnaires asking about potential threats - Compile responses into summary and send them back - Get more rounds of responses until consensus - Identify the most probable risks --- ### Brainstorming - Creative technique for generating ideas. - Encourages free thinking and building on suggestions. - Appropriate for exploring possibilities. - Form a group (no need for experts) - State the problem - Encourage ideas, no matter how crazy - Build and combine each others' ideas - Document the ideas and synthesize them -- #### Example: Enhancing Employee Engagement - Tech company's HR department. - Representatives from HR, IT, and different departments. - Generate ideas for a mobile app to enhance employee engagement. - Write them down and implement the relevant ones --- ### Panel of Experts - Assemble knowledgeable individuals - At the same time and spot - They meet, offer insights and expertise, and discuss - Aid in well-informed decisions. - Sometimes ends up with a report with conculsions -- #### Example: Environmental Policy Formulation - Government agency want to find identify and address most pressing environmental issues - Environmental scientists, economists, conservationists, and policymakers. - Discuss policy options. - Create comprehensive environmental policies. --- ### Focus Groups - Gather diverse participant - not necessarily experts - Share perceptions, attitudes, and opinions. - Provide qualitative data and consumer insights. -- #### Example: Market Research for a New TV SHOW - Proposing a new TV Show and trying to see how well it will do - Participants from various demographics. - Understand consumers' preferences and perceptions about the TV show - Fine-tune the product and marketing strategy. --- --- layout: false class: inverse, middle # Types of Data --- ### Longitudinal Data - Observations are collected for the same subject (entity) over a period of time - Same as time series data - Example: Tracking a company's annual revenue and number of employees over several years #### Longitudinal Data Example
- Another Example: Share of people with Diabetes in Mexico in years 2010, 2015, 2020 --- ### Cross-Sectional Data - Observations are collected at a single point in time - Example: A survey of customers' satisfaction with a product and likelihood of repurchase at a certain point in time #### Cross-Sectional Data Example
- Another Example: Share of people with Diabetes in 2010 in Mexico, USA, Canada, Brazil --- ### Panel Data - Combines both longitudinal and cross-sectional data - Observations are collected for multiple subjects over multiple points in time - Example: Tracking the annual revenue and number of employees of several companies over a few years #### Panel Data Example
- Another Example: Share of people with Diabetes in Mexico, USA, Canada, Brazil, each country in years 2010, 2015, 2020 --- ## Q1
-- **Panel data** - Multiple time observation per subject (currency) and multiple subjects --- ## Q2
-- **Cross-sectional data** - Single (time) observation per subject (user), many subjects --- ## Q3
-- **Longitudinal data** - Multiple (time) observations of a single subject --- layout: false class: inverse, middle ## Variable Types --- ## Variable Types We have two general types: .blue[Categorical] and .blue[Numerical] variables ### Categorical Variables - Variables that can be divided into one or more groups or categories. - **Ordinal:** These variables can be logically ordered or ranked. - *Variable:* Customer Satisfaction Survey Results - *Example:* Very Unsatisfied, Unsatisfied, Neutral, Satisfied, Very Satisfied - **Nominal:** These variables cannot be ordered or ranked. - *Variable:* Social Media Platforms Used - *Example:* Facebook, Instagram, Twitter, LinkedIn, TikTok, Snapchat --- ### Numerical Variables - Variables that hold numeric value and ordering is possible - **Discrete:** These variables can only take certain values - *Example*: Number of App Downloads from App Store - *Example*: Number of children you have - *Example*: Size of coke products: 0.33L, 0.5L, 1L, 2.25L <center> <img src=coke_sizes.jpg width="500"> </center> --- ### Numerical Variables - Variables that hold numeric value and ordering is possible - **Discrete:** These variables can only take certain values - *Example*: Number of App Downloads from App Store - *Example*: Number of children you have - *Example*: Size of coke products - **Continuous:** These variables can take any value within a range - *Example*: Time spent on a Webpage - *Example*: Exchange rate between MXN and USD -- - What's the main difference between ordinal and discrete? - We could say 1=Very unsatisfied, 2=Unsatisfied - But we cannot say that very unsatisfied has half of satisfaction of person who is just unsatisfied! - We can order, but these numbers don't have meaning in terms of distance between them --- ### Mexican Health Survey - Representative sample of the Mexican population .red[n=37858]. - We will use it to investigate market for Ozempic
-- - *Age*: Numerical, Discrete -- - *Gender*: Categorical, Nominal -- - *Weight*: Numerical, Continuous -- - *Location_type*: Categorical, Nominal -- - *Diabetes*: Categorical, Nominal -- - *Mother_diabetes*: Categorical, Nominal -- - *Difficulty_walking*: Categorical, Ordinal --- layout: false class: inverse, middle # Summarizing Data ## Graphical summaries --- ### Categorical variables ###Frequency Tables **Frequency table**: present the absolute frequencies (counts) and relative frequencies (shares) of each category. - Categories are mutually exclusive and collectively exhaustive - Relative frequency of category `\(i\)`: `\(p_i=\frac{n_i}{N}\)` - `\(n_i\)` is count of category `\(i\)` - `\(N\)` is total count in the sample .pull-left[
] .pull-right[
] --- ###Bar Charts **Bar charts** visually represents the frequency count of each category .center[ <!-- --> ] --- ###Bar Charts **Bar charts** visually represents the frequency count of each category .center[ <!-- --> ] --- ### More Creative Bar Chart <center> <img src=Bar_chart_food_poisoning.png width="800"> </center> --- ###Pie Charts **Pie chart**: Each slice is proportional to the category's frequency .center[
] --- ###Pie Charts **Pie chart**: (Angle of) Each slice is proportional to the category's frequency .center[
] --- ### My favorite pie chart <center> <img src=Netflix_pie_chart.jpg width="800"> </center> --- ### Frequency Distribution Suppose we survey people age 30-50 how many partners they had in their life. - What's the distribution of partners? - Calculate relative frequencies - Show them on a bar graph .pull-left[ #### Data
] .pull-right[ ####Distribution <img src="C_2_slides_d_files/figure-html/unnamed-chunk-14-1.png" width="100%" /> ] --- ### Frequency Distribution We can also show frequency of age of people who have diabetes from our data <img src="C_2_slides_d_files/figure-html/unnamed-chunk-15-1.png" width="100%" /> --- ### Frequency Distribution Compare it to the age distribution in the adult population (20+) <img src="C_2_slides_d_files/figure-html/unnamed-chunk-16-1.png" width="100%" /> --- ## Numerical Variables: Continuous - What about continuous values? Why can't we do the same? .pull-left[ <!-- --> ] .pull-right[
] - Most values never repeat, so they have very low relative frequency --- ## Histograms **Solution**: Group similar values together - Construct intervals and show how many observations are in a given interval -- **Process** 1. Decide how many intervals 2. And how wide they are 3. Then calculate the absolute and relative frequencies of each interval 4. Plot it with bars -- --- **My approach** - I want `\(k\)` (example `\(k\)`=5) equal intervals -- - Divide the range of the data into `\(k\)` equal intervals -- - *Range* is max-min of the data -- ``` r # Calculate max and min max_value <- max(Health_data$weight) min_value <- min(Health_data$weight) # Calculate the difference range <- max_value - min_value ``` -- ``` ## [1] "Range= 190.8078 - 30.3745 = 160.4333" ``` -- - With 5 intervals, each will be 32kg wide -- - The first one starts at the minimum value (30.3745) -- - The last one ends at the maximum value (190.8078) -- - Calculate how many observations I have in each interval and what's the relative frequency --- ## Histograms - Midpoint represents middle of the interval - center of the bar - `\(P_i\)` is cumulative frequency: share of observations in this or smaller interval - *Example*: `\(P_{(62.46-94.55)}=0.911\)` - *Interpretation*: 91.1% of people have weight lower than 94.55kg .pull-left[ <!-- --> ] .pull-right[
] --- ## Histogram with 10 Classes Now, let's increase the number of classes to 10. .pull-left[ <!-- --> ] .pull-right[
] --- ## Histogram with 100 Classes .pull-left[ <!-- --> ] .pull-right[
] - Helps to see the distribution and outliers - Is more always better? - With smaller intervals, histogram tends to the **probability density function** --- ## Probability Density Function (PDF) ### Definition - **Probability Density Function (pdf)** describes the probability distribution of a continuous random variable. - It **does not** give probability at a given value (this is always 0 for continous variable) - It shows which in which intervals that variable the most often appears - It is used to calculate the probability of the random variable being in a given interval - Area under it always adds up to 1 -- ### Example We have a random variable X representing the weight of adults in Mexican population. The PDF of X helps to describe the likelihood of finding a person of a specific weight within a range (e.g., between 58kg and 60kg). --- ### How They Work To calculate the probability of X falling within a specific range [a, b], you need to integrate the PDF from a to b: `\(P(a \leq X \leq b) = \int_{a}^{b} f(x) \, dx\)` What is the share of population with weight between 65kg and 75kg?
--- ### How They Work To calculate the probability of X falling within a specific range [a, b], you need to integrate the PDF from a to b: `\(P(a \leq X \leq b) = \int_{a}^{b} f(x) \, dx\)` What is the share of population with weight between 40 and 50kg?
--- ### How They Work To calculate the probability of X falling within a specific range [a, b], you need to integrate the PDF from a to b: `\(P(a \leq X \leq b) = \int_{a}^{b} f(x) \, dx\)` What is the share of population with weight between 66.99 and 67 kg?
--- ### How They Work To calculate the probability of X falling within a specific range [a, b], you need to integrate the PDF from a to b: `\(P(a \leq X \leq b) = \int_{a}^{b} f(x) \, dx\)` What is the share of population with weight between 40 and 100 kg?
--- ## Distribution Shapes: Modality <img src="C_2_slides_d_files/figure-html/unnamed-chunk-31-1.png" width="100%" /> --- ## Which is uniformaly distributed 1. Weights of Adult Females 2. Salaries in Mexico 3. Airbnb prices in CDMX 4. Birthdays of Classmates (day of the month) --- <center> <img src=Distributions_q_data.png width="800"> </center> --- ## Distribution Shapes: Skewness <img src="C_2_slides_d_files/figure-html/unnamed-chunk-32-1.png" width="100%" /> --- ### Age at death <center> <img src=Age_at_death.jpeg width="800"> </center> --- #### What if we want to calculate proportion of people who weight less or equal to 50kg? --- ## Cumulative Distribution Function (CDF) The .blue[Cumulative Distribution Function] (CDF) gives the probability that a random variable X will take on a value less than or equal to a specific value. For a continuous random variable X with PDF f(x), the CDF F(x) is defined as: `\(F(x) = \int_{-\infty}^{x} f(t) \, dt = P(X \leq x)\)` Characteristics: - The CDF starts (for minus infinity) at 0 (minimum) - It approaches 1 as x approaches infinity (maximum) - It is non decreasing - It is right continuous --- ## Example 1: Normal Variable (weight in the population) `\(F(50) = \int_{-\infty}^{50} f(t) \, dt = P(X \leq 50)=0.02\)` <!-- --> --- ## Example 2: Normal Variable (weight in the population) `\(F(72) = \int_{-\infty}^{72} f(t) \, dt = P(X \leq 72)=0.58\)` <!-- --> --- ## Example 3: Normal Variable (weight in the population) `\(F(102) = \int_{-\infty}^{102} f(t) \, dt = P(X \leq 102)=0.99\)` <!-- --> Never integrate a CDF! --- ### Empirical CDF What if we only have a sample and we don't know the true pdf? Intuition on how it comes up: <!-- --> <!-- --> --- ### Empirical CDF What if we only have a sample and we don't know the true pdf? Intuition on how it comes up: <!-- --> <!-- --> --- ### Empirical CDF `\(ECDF(x)=\frac{\sum I(w_i\leq x)}{N}=\frac{\text{Number of people with weight lower than x}}{N}\)` <small> - `\(I(w_i<x)=1\)` if weight of person `\(i\)` is lower than `\(x\)` (*Indicator Function*) - `\(N\)` is total number of people (*Sample Size*) - Share of people with weight lower than x </small>
-- - So how do we calculate share of people with weight=<50kg? -- `\(P(weight\leq50)=ECDF(50)\)` -- - What about more than 100? -- `\(P(weight>100)=1-P(weight\leq100)=1-ECDF(100)\)` --- <center> <img src=Exam_q_dist.png width="800"> </center> - Is `\(F_x(X)\)` a valid distribution function? - What's the probability that the rent is larger than 10 000? --- --- layout: false class: inverse, middle # Summarizing Data ## Comparisions and Associations --- ##Comparisions - Descriptive and visual comparisons -- - NOT declaring statistically significant differences, just eyeballing -- - That's coming next --- ###Comparing categorical variables ####Do people living in rural areas are more likely to have diabetes? - We have two categorical variables - We can use frequency table to see how diabetes is distributed among the two types of areas: <table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption></caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> No Diabetes </th> <th style="text-align:right;"> Has Diabetes </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Rural </td> <td style="text-align:right;"> 8906 </td> <td style="text-align:right;"> 993 </td> </tr> <tr> <td style="text-align:left;"> Urban </td> <td style="text-align:right;"> 24780 </td> <td style="text-align:right;"> 3179 </td> </tr> </tbody> </table> --- ###Comparing categorical variables ####Do people living in rural areas are more likely to have diabetes? - Are relative frequencies more helpful? - Share of each subgroup within the sample <table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption></caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> No Diabetes </th> <th style="text-align:right;"> Has Diabetes </th> <th style="text-align:right;"> Total </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Rural </td> <td style="text-align:right;"> 0.24 </td> <td style="text-align:right;"> 0.03 </td> <td style="text-align:right;"> 0.27 </td> </tr> <tr> <td style="text-align:left;"> Urban </td> <td style="text-align:right;"> 0.65 </td> <td style="text-align:right;"> 0.08 </td> <td style="text-align:right;"> 0.73 </td> </tr> <tr> <td style="text-align:left;"> Total </td> <td style="text-align:right;"> 0.89 </td> <td style="text-align:right;"> 0.11 </td> <td style="text-align:right;"> 1.00 </td> </tr> </tbody> </table> -- - Can we compare numbers in the *Has Diabetes* column? -- - **Marginal frequencies** are total probabilities by group --- #### Table of frequency - We want to compare whether someone living in rural area is more likely to have diabetes than someone living in urban area -- - So we want to see whether: $$ \scriptsize P(Diabetes_i=1|Area_i=Rural)>P(Diabetes_i=1|Area_i=Urban)$$ -- - We want to look at the **relative conditional frequencies** - They are usually in **contingency tables** - Share with diabetes within urban sample - Share with diabetes within rural sample -- <table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption></caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> No Diabetes </th> <th style="text-align:right;"> Has Diabetes </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Rural </td> <td style="text-align:right;"> 0.90 </td> <td style="text-align:right;"> 0.10 </td> </tr> <tr> <td style="text-align:left;"> Urban </td> <td style="text-align:right;"> 0.89 </td> <td style="text-align:right;"> 0.11 </td> </tr> </tbody> </table> $$ \scriptsize P(Diabetes_i=1|Area_i=Rural)=\scriptsize \frac{ P(Diabetes_i=1 \cap Area_i=Rural) }{P(Area_i=Rural)}\approx\frac{0.03}{0.03+0.24} \approx 0.1$$ Or: $$ \scriptsize P(Diabetes_i=1|Area_i=Rural)=\scriptsize \frac{ \text{Number live in Rural & Have diabetes} }{\text{Number live in Rural}}=\frac{993}{993+8906} \approx 0.1$$ --- <table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption></caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> No Diabetes </th> <th style="text-align:right;"> Has Diabetes </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Rural </td> <td style="text-align:right;"> 0.90 </td> <td style="text-align:right;"> 0.10 </td> </tr> <tr> <td style="text-align:left;"> Urban </td> <td style="text-align:right;"> 0.89 </td> <td style="text-align:right;"> 0.11 </td> </tr> </tbody> </table> -- - What about marginal frequencies here? - Row sums should add up to 1 - `\(\scriptsize P(Diabetes_i=1|\text{Area=Rural}_i)+P(Diabetes_i=0|\text{Area=Urban}_i)\)` - Column sums are meaningless - `\(\scriptsize P(Diabetes_i=1|\text{Area=Rural}_i)+P(Diabetes_i=1|\text{Area=Urban}_i)\)` --- - We can visualize it on a barplot .center[
] --- - Or better on a **stacked barplot** .center[
] - *Stacked barplot* clearly shows the distribution of diabetes within each group --- ###Practice - Are you more likely to have diabetes if your mother had diabetes? - By how much? <table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption></caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> No Diabetes </th> <th style="text-align:right;"> Has Diabetes </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Mother No Diabetes </td> <td style="text-align:right;"> 25270 </td> <td style="text-align:right;"> 2427 </td> </tr> <tr> <td style="text-align:left;"> Mother Has Diabetes </td> <td style="text-align:right;"> 8283 </td> <td style="text-align:right;"> 1721 </td> </tr> </tbody> </table> --- ###Practice <table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption></caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> No Diabetes </th> <th style="text-align:right;"> Has Diabetes </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Mother No Diabetes </td> <td style="text-align:right;"> 0.91 </td> <td style="text-align:right;"> 0.09 </td> </tr> <tr> <td style="text-align:left;"> Mother Has Diabetes </td> <td style="text-align:right;"> 0.83 </td> <td style="text-align:right;"> 0.17 </td> </tr> </tbody> </table> - Does it mean that having diabetic mother **causes** higher change of having diabetes? --- ### One quantitative and one categorical - For quantitative variables we can compare some summary statistics - Are people with diabetes older than people without it? -- - *Example* means in two subpopulations
--- ### One quantitative and one categorical - Or we can do Box and Whiskers plots as before - Or we can compare the whole distributions of frequencies <img src="C_2_slides_d_files/figure-html/2g-1.png" width="100%" /> --- #### One quantitative and one categorical - For continuous variables we can use the same methods (except frequency distribution) - Instead, we can compare densities or histograms - Are men heavier than women? -- .center[ <img src="C_2_slides_d_files/figure-html/2ha-1.png" width="100%" /> ] --- ### Associations: Two Quantitative Variables - Likely people would subscribe to the website to lose weight -- - But do these people have resources? -- - What is the relationship between Body Mass Index (BMI) and Income? -- - More generally, how to measure .blue[association between two quantitative variables] -- - Association between qualitative variables is measured with contingency tables --- ### Associations - Suppose we surveyed people from Guadalajara and CDMX about their .blue[BMI], .blue[education] and .blue[income]. - Scatter plots show associations between two quantitative variables - We put variables of interest (*example*: Y and X) on the axis - We place observation on the cartesian plane using their values of variable X and Y: `\(\{(x_1,y_1),(x_2,y_2)..\}\)` - In our case: - X axis is BMI - Y axis is Income - An individual `\(i\)` is placed on these axis based on `\((BMI_i, Income_i)\)`
---
--- ### Associations - Scatterplots become very messy if you have a lot of observations
--- ### Associations - If n is larger, better to use binscatter: - Group x variable into quantiles (ex: 10 deciles) - Calculate average of y in each decile - Plot <img src="C_2_slides_d_files/figure-html/2zf-1.png" width="90%" /> ``` ## Call: binsreg ## ## Binscatter Plot ## Bin/Degree selection method (binsmethod) = User-specified ## Placement (binspos) = Quantile-spaced ## Derivative (deriv) = 0 ## ## Group (by) = Full Sample ## Sample size (n) = 10000 ## # of distinct values (Ndist) = 3214 ## # of clusters (Nclust) = NA ## dots, degree (p) = 0 ## dots, smoothness (s) = 0 ## # of bins (nbins) = 10 ``` --- ### Assocations - Would you say that the relationship is stronger in Guadalajara or in Mexico City? <img src="C_2_slides_d_files/figure-html/2h-1.png" width="80%" /> - How to measure the strength of the relationship? --- ### Associations #### Covariance - **Covariance** measures the strength of the relationship between two variables. `$$\text{Cov}(X, Y) = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu_X)(y_i - \mu_Y)$$` And it's sample equivalent is: `$$\hat{\text{Cov}}(X, Y) = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})$$` -- - Covariance whether the two variables move together -- - Covariance increases when: - The relationship is stronger - The deviations of variables are larger --- <iframe src="https://shiny.rit.albany.edu/stat/rectangles/" width="100%" height="550px" data-external="1"></iframe> .footmark[ Source: [https://shiny.rit.albany.edu/stat/rectangles/](https://shiny.rit.albany.edu/stat/rectangles/) ] --- ### Covariance <img src="C_2_slides_d_files/figure-html/2zz-2.png" width="100%" /> --- ### Covariance - What has stronger relationship with Income: BMI or Years of Education? <img src="C_2_slides_d_files/figure-html/2zzza-1.png" width="100%" /> - BMI has larger covariance -- - But we can't compare covariances of different variables -- - Covariance depends on the scales (or units) of the variable -- - All else equal, larger standard deviation implies larger covariance - The squares are just bigger --- ### Reminder We often use it to calculate variance of a sum or difference of two random variables `$$Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)$$` `$$Var(X-Y)=Var(X)+Var(Y)-2Cov(X,Y)$$` Reminder: if a is a constant `$$E(aX)=aE(X) \quad and \quad E(a+X)=E(X)+a$$` And `$$E(X+Y)=E(X)+E(Y)$$` More on that in the homework! --- ### Correlation - **Correlation measures** the strength of a linear relationship between two variables. - It ranges between -1 and 1 **Population Correlation coefficient**: `$$\rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \cdot \sigma_Y}$$` **Sample Correlation coefficient**: `$$\hat{\rho}(X, Y) = \frac{\hat{\text{Cov}}(X, Y)}{s_X \cdot s_Y}$$` Where `\(s_X = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2}\)` --- ### Correlation - Correlation is preferred over covariance because it's **scale-independent** and easier to interpret. - Suppose that instead of measuring income (Y variable) in MXN , we measure it in Dollars. - `\(Z\)` income in dollars `\(Z=\frac{Y}{16}\)` -- - Is `\(Cov(X,Z)=Cov(X,Y)\)`? `\begin{align*} cov(X,Z) &=\frac{1}{N} \sum_{i=1}^{N} (x_i - \mu_X)(z_i - \mu_Z) \\ &=\frac{1}{N} \sum_{i=1}^{N} (x_i - \mu_X)(\frac{y_i}{16}- \frac{\mu_Y}{16}) \\ &=\frac{1}{16}\frac{1}{N} \sum_{i=1}^{N} (x_i - \mu_X)(y_i- \mu_Y) \\ & \neq cov(X,Y) \end{align*}` --- ### Correlation - Correlation is preferred over covariance because it's **scale-independent** and easier to interpret. - Suppose that instead of measuring income (Y variable) in MXN , we measure it in Dollars. - `\(Z\)` income in dollars `\(Z=\frac{Y}{16}\)` -- - Is `\(\rho(X,Z)=\rho(X,Y)\)`? `\begin{align*} \rho(X,Z) &=\frac{\frac{1}{N} \sum_{i=1}^{N} (x_i - \mu_X)(z_i - \mu_Z))}{\sqrt{\sum_{i=1}^{N} (x_i - \mu_X)^2} \cdot \sqrt{\sum_{i=1}^{N} (z_i - \mu_Z)^2}} \\ &=\frac{\frac{1}{N} \sum_{i=1}^{N} \sum_{i=1}^{N} (x_i - \mu_X)(\frac{y_i}{16}- \frac{\mu_Y}{16})}{\sqrt{\sum_{i=1}^{N} (x_i - \mu_X)^2} \cdot \sqrt{\sum_{i=1}^{N} (\frac{y_i}{16} - \frac{\mu_Y}{16})^2}} \\ &=\frac{\frac{1}{16} \frac{1}{N} \sum_{i=1}^{N} \sum_{i=1}^{N} (x_i - \mu_X)(y_i- \mu_Y)}{\frac{1}{16} \sqrt{\sum_{i=1}^{N} (x_i - \mu_X)^2} \cdot \sqrt{\sum_{i=1}^{N} (y_i - \mu_Y)^2}} \\ & = \rho(X,Y) \end{align*}` --- ### Correlation - Correlation with education is actually stronger <img src="C_2_slides_d_files/figure-html/2zzz-1.png" width="100%" /> ---
--- ### Correlation 1. Correlation is a value between -1 and 1: `\(-1 \leq \rho(X, Y) \leq 1\)`. -- 2. Perfect positive correlation: `\(\rho = 1\)`. Perfect negative correlation: `\(\rho = -1\)`. -- 3. No linear correlation: `\(\rho = 0\)`, but this doesn't imply independence. -- 4. Correlation measures **linear** relationships; nonlinear relationships might not be accurately captured. -- 5. Correlation doesn't imply causation; a relationship could be coincidental. --- ### Causality vs Correlation <center> <img src=Trump.jpg width="800"> </center> --- ### Causality vs Correlation <iframe src="https://tylervigen.com/spurious-correlations" width="100%" height="550px" data-external="1"></iframe> .footmark[ Source: [https://tylervigen.com/spurious-correlations](https://tylervigen.com/spurious-correlations) ] --- ### Causality vs Correlation - Less obvious examples - You look at historical data from some media campaign - You notice that people who were more exposed to ads were less likely to buy that product - What can you conclude? -- - Are people who were exposed to ads similar to people who were not? -- - Maybe they were targeted in the first place because they are less likely to buy and you want to change it? --- ### Causality vs Correlation - Less obvious examples - Education usually correlates with Income (correlation) - Does it mean that if decide to get a degree, you will earn more? (causality)